19 research outputs found

    TOMOBFLOW: feature-preserving noise filtering for electron tomography

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Noise filtering techniques are needed in electron tomography to allow proper interpretation of datasets. The standard linear filtering techniques are characterized by a tradeoff between the amount of reduced noise and the blurring of the features of interest. On the other hand, sophisticated anisotropic nonlinear filtering techniques allow noise reduction with good preservation of structures. However, these techniques are computationally intensive and are difficult to be tuned to the problem at hand.</p> <p>Results</p> <p>TOMOBFLOW is a program for noise filtering with capabilities of preservation of biologically relevant information. It is an efficient implementation of the Beltrami flow, a nonlinear filtering method that locally tunes the strength of the smoothing according to an edge indicator based on geometry properties. The fact that this method does not have free parameters hard to be tuned makes TOMOBFLOW a user-friendly filtering program equipped with the power of diffusion-based filtering methods. Furthermore, TOMOBFLOW is provided with abilities to deal with different types and formats of images in order to make it useful for electron tomography in particular and bioimaging in general.</p> <p>Conclusion</p> <p>TOMOBFLOW allows efficient noise filtering of bioimaging datasets with preservation of the features of interest, thereby yielding data better suited for post-processing, visualization and interpretation. It is available at the web site <url>http://www.ual.es/%7ejjfdez/SW/tomobflow.html</url>.</p

    A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

    No full text
    This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 % , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65Ă— faster than the case in which we fully decompose our stencil without tiling and 5.3Ă— faster with respect to the fully fused version on the NVIDIA GPUs

    Demystifying the 16 Ă— 16 thread-block for stencils on the GPU

    Get PDF
    \u3cp\u3eSummary Stencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state-of-the-art architectures, such graphics processing units (GPUS). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread-block that maximizes the overall performance. Usually, programmers look for the optimal thread-block configuration in a reduced space of square thread-block configurations or simply use the best configurations reported in previous works, which is usually 16 Ă— 16. This paper provides a better understanding of the impact of thread-block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi-processors utilization; and (4) a high efficiency of the threads within a thread-block. The model determines the set of optimal thread-block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of the best configuration and guarantee the inclusion of the best configurations.\u3c/p\u3
    corecore